Activity 6

http://bit.ly/2oOdqlV # voter_sample.csv
http://bit.ly/2I9CFaC # voter_tidy.csv

Data Integrity checks:

  1. How have the number of people registered changed over time?
  2. What is the distribution of ages within the full file? Within OMV?
  3. Did anyone register before age 18?
  4. Anyone double registered?

Data Exploration:

  1. What is the relationship between party identification and voter proportion?
  2. Do metro areas have different party breakdown than rural areas?
  3. What is the relationship between age and voter participation when controlling for party?
  4. How do those registered by OMV compare to those not registered by OMV?

voter_sample <- read_csv("http://bit.ly/2oOdqlV")
voter_tidy <- read_csv("http://bit.ly/2I9CFaC")
library(lubridate)
voter_tidy %>%
  mutate(birth_year = year(mdy(BIRTH_DATE)),
         age = 2016 - birth_year) %>%
  filter(PARTY_CODE %in% c("REP", "DEM", "NAV")) %>%
  sample_n(10000) %>%
  ggplot(aes(x = age, y = vote_prop, col = factor(PARTY_CODE))) +
  geom_point(alpha = .3) +
  stat_smooth() +
  xlim(c(0, 100))

Is this age?

Regression

Focus on the precinct

Let's go back pre-voter_sample and create a data set where

  • unit of observation is the precinct
  • we're only looking at the 2016 election.

Why would we do this?

  • ease of computing
  • merge with precinct-level information
  • now we're talking about voter turnout

Focus on the precinct I

library(DescTools)
wd <- getwd()
setwd("../../omv/or_voter_history")

filenames <- list.files(pattern = "2017.txt")
files <- lapply(filenames,read_delim, 
                col_types = paste(rep("c", 65), collapse = ""),
                delim = "\t", 
                escape_double = FALSE, 
                trim_ws = TRUE)

vs <- bind_rows(files)
rm(files)

Focus on the precinct II

or_precinct <- vs %>%
  select(PARTY_CODE, COUNTY, BIRTH_DATE, `11/08/2016`, PRECINCT) %>%
  mutate(age_at_election = (mdy(BIRTH_DATE) %--% 
                            mdy("11/08/2016")) %/% 
                            years(1)) %>%
  group_by(COUNTY, PRECINCT) %>%
  summarize(p_vote    = mean(`11/08/2016` == "YES", na.rm = TRUE),
            avg_age   = mean(age_at_election, na.rm = TRUE),
            p_NAV     = mean(PARTY_CODE == "NAV", na.rm = TRUE),
            party_div = Gini(as.factor(PARTY_CODE)),
            count     = n()) %>%
  mutate(metro = COUNTY %in% c("MULTNOMAH", "CLACKAMAS", 
                               "WASHINGTON", "LANE", "BEND"))
setwd(wd)
write_csv(or_precinct, "or_precinct.csv")

or_precinct <- read_csv("../../omv/or_precinct.csv")
head(or_precinct)
## # A tibble: 6 x 8
##   COUNTY PRECINCT p_vote avg_age p_NAV party_div count metro
##   <chr>  <chr>     <dbl>   <dbl> <dbl>     <dbl> <int> <lgl>
## 1 ACP    ACP       0       NaN   0         0       104 F    
## 2 BAKER  01        0.616    51.7 0.297     0.238  1426 F    
## 3 BAKER  02        0.599    49.0 0.318     0.238  1411 F    
## 4 BAKER  03        0.593    49.1 0.327     0.232  1546 F    
## 5 BAKER  04        0.626    50.9 0.302     0.233  1753 F    
## 6 BAKER  05        0.676    52.2 0.289     0.231  1729 F
or_precinct <- filter(or_precinct, COUNTY != "ACP")

What is the relationship between p_vote and avg_age?

ggplot(or_precinct, aes(x = avg_age, y = p_vote)) +
  geom_point(alpha = .3)

Some sensible filtering.

or_precinct <- filter(or_precinct, count > 10)

What is the relationship between p_vote and avg_age?

ggplot(or_precinct, aes(x = avg_age, y = p_vote)) +
  geom_point(alpha = .3)

What is the relationship between p_vote and avg_age?

m1 <- lm(p_vote ~ avg_age, data = or_precinct)
summary(m1)
## 
## Call:
## lm(formula = p_vote ~ avg_age, data = or_precinct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53950 -0.04771  0.00452  0.04967  0.28617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.254308   0.020624   12.33   <2e-16 ***
## avg_age     0.008721   0.000410   21.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07351 on 1328 degrees of freedom
## Multiple R-squared:  0.2541, Adjusted R-squared:  0.2536 
## F-statistic: 452.5 on 1 and 1328 DF,  p-value: < 2.2e-16

What is the relationship between p_vote and p_NAV?

ggplot(or_precinct, aes(x = p_NAV, y = p_vote)) +
  geom_point(alpha = .3)

What is the relationship between p_vote and p_NAV?

m2 <- lm(p_vote ~ p_NAV, data = or_precinct)
summary(m2)
## 
## Call:
## lm(formula = p_vote ~ p_NAV, data = or_precinct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38699 -0.02772  0.00288  0.03023  0.21346 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.018333   0.007014  145.18   <2e-16 ***
## p_NAV       -1.150296   0.024133  -47.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05169 on 1328 degrees of freedom
## Multiple R-squared:  0.6311, Adjusted R-squared:  0.6308 
## F-statistic:  2272 on 1 and 1328 DF,  p-value: < 2.2e-16

What is the relationship between p_vote and metro?

ggplot(or_precinct, aes(x = metro, y = p_vote)) +
  geom_boxplot()

What is the relationship between p_vote and metro?

m3 <- lm(p_vote ~ metro, data = or_precinct)
summary(m3)
## 
## Call:
## lm(formula = p_vote ~ metro, data = or_precinct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52716 -0.05034  0.01020  0.05635  0.30797 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.690327   0.002869 240.618   <2e-16 ***
## metroTRUE   0.001707   0.004932   0.346    0.729    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08511 on 1328 degrees of freedom
## Multiple R-squared:  9.02e-05,   Adjusted R-squared:  -0.0006627 
## F-statistic: 0.1198 on 1 and 1328 DF,  p-value: 0.7293

What is the relationship between p_vote and party_div?

ggplot(or_precinct, aes(x = party_div, y = p_vote)) +
  geom_point(alpha = .3)

What is the relationship between p_vote and party_div?

m4 <- lm(p_vote ~ party_div, data = or_precinct)
summary(m4)
## 
## Call:
## lm(formula = p_vote ~ party_div, data = or_precinct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52260 -0.04944  0.01019  0.05648  0.31357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.68035    0.01320  51.532   <2e-16 ***
## party_div    0.03757    0.04617   0.814    0.416    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08515 on 1324 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.0004999,  Adjusted R-squared:  -0.000255 
## F-statistic: 0.6622 on 1 and 1324 DF,  p-value: 0.4159

What is the relationship between p_vote and p_NAV when controlling for metro?

ggplot(or_precinct, aes(x = avg_age, y = p_vote, color = metro)) +
  geom_point(alpha = .3)

What is the relationship between p_vote and p_NAV when controlling for metro?

m5 <- lm(p_vote ~ p_NAV + metro, data = or_precinct)
summary(m5)
## 
## Call:
## lm(formula = p_vote ~ p_NAV + metro, data = or_precinct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37441 -0.02904  0.00272  0.03026  0.21032 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.029522   0.007188 143.236  < 2e-16 ***
## p_NAV       -1.168858   0.024048 -48.606  < 2e-16 ***
## metroTRUE   -0.017452   0.002985  -5.846 6.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05106 on 1327 degrees of freedom
## Multiple R-squared:  0.6404, Adjusted R-squared:  0.6398 
## F-statistic:  1181 on 2 and 1327 DF,  p-value: < 2.2e-16

What is the relationship between p_vote and p_NAV when controlling for avg_age?

library(plotly)
(p <- plot_ly(or_precinct, x = ~p_NAV, y = ~avg_age, 
        z = ~p_vote, colors = '#BF382A', opacity = .3, marker = list(size = 3)) %>%
  add_markers())

What is the relationship between p_vote and p_NAV when controlling for avg_age?

m6 <- lm(p_vote ~ p_NAV + avg_age, data = or_precinct)
summary(m6)
## 
## Call:
## lm(formula = p_vote ~ p_NAV + avg_age, data = or_precinct)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35766 -0.02714  0.00296  0.02929  0.18672 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.8427669  0.0206443  40.823   <2e-16 ***
## p_NAV       -1.0370032  0.0265969 -38.990   <2e-16 ***
## avg_age      0.0028627  0.0003178   9.009   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0502 on 1327 degrees of freedom
## Multiple R-squared:  0.6524, Adjusted R-squared:  0.6518 
## F-statistic:  1245 on 2 and 1327 DF,  p-value: < 2.2e-16

Classification

Example: Credit Default

library(ISLR)
head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

Exploratory Data Analysis

Model Fitting

m1 <- glm(default ~ balance, data = Default, family = binomial)
coef(m1)
##   (Intercept)       balance 
## -10.651330614   0.005498917

Logistic Model

\[ \hat{P}(Y = 1) = \frac{1}{1 + e^{-(-10.65 + 0.0055 x_i)}} \]

Logistic Model

\[ \hat{P}(Y = 1) = \frac{1}{1 + e^{-(-10.65 + 0.0055 x_i)}} \]

summary(m1)$coef
##                  Estimate   Std. Error   z value      Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance       0.005498917 0.0002203702  24.95309 1.976602e-137

Where did those SEs come from?

Multiple Logisitic Regression

Add student as a predictor?

m2 <- glm(default ~ balance + student, data = Default, family = binomial)
summary(m2)$coef
##                  Estimate  Std. Error    z value      Pr(>|z|)
## (Intercept) -10.749495878 0.369191361 -29.116326 2.230782e-186
## balance       0.005738104 0.000231847  24.749526 3.136911e-135
## studentYes   -0.714877620 0.147519010  -4.846003  1.259734e-06

What's going on?

Activity

Oregon voters

Using http://bit.ly/2oOdqlV (voter_sample.csv), build a model to answer the following question:

What is the association between age and whether or not someone voted in the November 2016 election, after controlling for political party?